--- title: "Quantitative biases analyses using Bayesian methods" toc: true author: "Simon Dufour" date: 2021-08-27T21:13:14-05:00 categories: ["R", "Bayesian"] tags: ["R", "Bias", "OpenBUGS", "Bayesian"] ---
This document was developped as part of a 5-day course on quantitative biases analyses developed with my colleague Ian Dohoo, from University of Prince Edward Island. In my part of the course, I reviewed the Bayesian methods that can be used to adjust directly in a regression model for the different biases (selection and misclassification biases, or for an unmeasured confounder). It covers a wee bit of theory (though most of the theory was covered in the course), but mainly the R and R2OpenBUGS codes for conducting such analyses. Special thanks to Henrik Stryhn, also from University of Prince Edward Island, who reviewed a previous version of this document.
{{< figure src="sequence bias.png" >}}In this document you will find indications on how to use various R libraries and functions for conducting quantitative bias adjustment in a Bayesian framework, along with various exercises. We suppose that you do have some basic R skills. If you have not worked with R before or feel a bit rusty, here are some resources to help you to prepare:
Chapters 1, 2, and 4 of the CodeCademy “Learn R” course will provide a good overview of the basic concepts required for this workshop.
If you are familiar with R and want to do some further reading, Hadley Wickham’s “R for Data Science” is a great resource.
Remember, there are often many different ways to conduct a given piece of work in R. Throughout this document, we tried to stick with the simpler approaches (e.g., the shortest code, the minimal number of R libraries).
Throughout the document, you will find examples of R code along with comments. The R code used always appear in grey boxes (see the following example). This is the code that you will be able to use for your own analyses. Lines that are preceded by a # sign are comments, they are skipped when R reads them. Following each grey box with R code, a white box with results from the analysis is presented.
For instance, this is a R code where I am simply asking to show main descriptive statistics for the speed variable of the cars dataset (note that this dataset is already uploaded in R).
#This is a comment. R will ignore this line
#The summary() function can be use to ask for a summary of various R objects. For a variable (here the variable speed), it shows descriptive statistics.
summary(cars$speed)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 4.0 12.0 15.0 15.4 19.0 25.0
Throughout the document, in the text I will use:
- Italics for datasets or variables. For instance, the speed variable of the dataset cars.
- Shaded boxes for R libraries (e.g., episensr) and functions (e.g., summary()).
In R we can first call a given library and then use the functions related to each library or we can name the library followed by :: and then the function. For instance the two following pieces of code are equivalent:
library(ggplot2)
ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
##OR##
ggplot2::ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
The latter may improve reproducibility, but to the expense of longer codes. Throughout the document, we will always first call the library and then run the functions to keep codes shorter.
One last thing, when using a given function, it is not mandatory to name all the arguments, as long as they are presented in the sequence expected by this function. For instance, the ggplot() function that we used in the previous chunk of code is expecting to see first a dataset (data=) and then a mapping attribute (mapping=) and, within that mapping attribute a x variable (x=). We could shorten the code by omitting all of these. The two following pieces of code are, therefore, equivalent:
library(ggplot2)
ggplot(data=cars, mapping=(aes(x=speed)))+
geom_histogram()
##OR##
library(ggplot2)
ggplot(cars, (aes(speed)))+
geom_histogram()
Throughout the document, however, we will use the longer code with all the arguments being named. Since you are learning these new functions, it would be quite a challenge to use the shorter code right from the start. But you could certainly adopt the shorter codes later on.
LET’S GET STARTED!
One of the first thing that you will need for any Bayesian analysis is a way to generate and visualize a prior distribution corresponding to some scientific knowledge that you have about an unknown parameter. Generally, you will use information such as the mean and variance or the mode and the 2.5th (or 97.5th) percentile to find a corresponding distribution. You will use various distributions, for instance:
- Normal distribution: defined by a mean (mu) and its standard deviation (SD) or variance (tau). In some software, OpenBUGS for instance, the inverse of the variance (1/tau) is used to specify a given normal distribution. Notation: dNorm(mu, 1/tau)
- Uniform distribution: defined by a minimum (min) and maximum (max). Notation: dUnif(min, max)
- Beta distribution: bounded between 0 and 1, beta distributions are defined by two shape parameters (a) and (b). Notation: dBeta(a, b)
The dnorm() function can be used to generate a given Normal distribution and the curve() function can be used to visualize the generated distribution. These functions are already part of R, there is no need to upload a R package.
curve(dnorm(x, mean=2.0, sd=0.5), # I indicate mean and SD of the distribution
from=-2, to=7, # I indicate limits for the plot
main="Normal distribution with mean of 2.0 and SD of 0.5", #Adding a title
xlab = "Value", ylab = "Density") #Adding titles for axes
Figure. Density curve of a Normal distribution.
Note that a Normal distribution with mean of zero and a very large SD provides very little information. Such distribution would be referred to as a uniform or flat distribution (A.K.A.; a vague distribution).
curve(dnorm(x, mean=0.0, sd=10000000000),
from=-100, to=100,
main="A flat Normal distribution",
xlab = "Value", ylab = "Density")
Figure. Density curve of a flat Normal distribution.
In the same manner, we could visualize an uniform distribution using the dunif() function. In the following example, we assumed that any values between -5.0 and 5.0 are equally probable.
curve(dunif(x, min=-5.0, max=5.0),
from=-10, to=10,
main="Uniform distribution with -5.0 and 5.0 limits",
xlab = "Value", ylab = "Density")
Figure. Density curve of an Uniform distribution.
Finally, Beta distributions are another type of distributions that will specifically be used for parameters that are proportions (i.e., bounded between 0.0 and 1.0). The epi.betabuster() function from the epiR library can be used to define a given prior distribution based on previous knowledge. When you use the epi.betabuster() function, it creates a new R object containing different elements. Among these, one element will be named shape1 and another shape2. These correspond to the a and b shape parameters of the corresponding Beta distribution.
For instance we may know that the most likely value for the sensitivity of a given test is 0.85 and that we are 95% certain that it is greater than 0.75. With these values, we will be able to find the a and b shape parameters of the corresponding Beta distribution.
library(epiR)
# Sensitivity of a test as Mode=0.85, and we are 95% sure >0.75
rval <- epi.betabuster(mode=0.85, conf=0.95, greaterthan=T, x=0.75) # I create a new object named rval
rval$shape1 #View the a shape parameter in rval
## [1] 46.348
rval$shape2 ##View the b shape parameter in rval
## [1] 9.002588
#plot the prior distribution
curve(dbeta(x, shape1=rval$shape1, shape2=rval$shape2), from=0, to=1,
main="Prior for test's sensitivity", xlab = "Proportion", ylab = "Density")
Figure. Density curve of a Beta distribution for a test sensitivity.
Note that a dBeta(1.0, 1.0) is a uniform beta distribution.
#plot the prior distribution
curve(dbeta(x, shape1=1.0, shape2=1.0), from=0, to=1,
main="A Beta(1.0, 1.0) or flat distribution", xlab = "Proportion", ylab = "Density")
Figure. Density curve of a Beta(1.0, 1.0) distribution.
OpenBUGS is one of the most used Bayesian software. It is a bit tedious to use, though, and it requires a lot of “point and click,” which reduce reproducibility. Although OpenBUGS can be used with a script mode (i.e., coding vs. point and click), it is a lot easier to operate OpenBUGS within R. The R2OpenBUGS library provides a R interface that will use OpenBUGS for conducting Bayesian analyses. The codes that you have used in R can easily be saved and re-used or modified later. The output generated through R can later be used to produce figures or results tables for publication.
Basically, to run a given model in OpenBUGS we will need:
This is the easy part. For these exercises we provided the datasets as CSV files. The function read.csv() can be used to upload a CSV dataset. In the R projects that you will use for the exercises, we always stored the datasets in a folder named Data. For the exercises, we will use the study dataset (cyst_bias_study.csv).
Below we imported the dataset, named it cyst, and checked the first 6 rows using the head() function.
#Upload the .csv dataset:
cyst <- read.csv(file="Data/cyst_bias_study.csv", header=TRUE, sep=";")
head(cyst)
## id cyst test know quest sw lat rand
## 1 2 NA 0 NA 1 1 0 1
## 2 5 NA 0 NA 1 0 1 0
## 3 9 NA 0 NA 1 1 1 1
## 4 10 NA 1 NA 1 0 0 1
## 5 13 NA 1 NA 0 1 0 0
## 6 15 NA 0 NA 1 1 0 1
We will focus for now on the variable test as our outcome (i.e., whether or not the individual was positive for cysticercosis; this is a version of cyst with some misclassification included). We will use quest as our main predictor (whether or not the individual scored “knowledgeable for cysticercosis”; this is a version of know with some misclassification included).
We can check how many rows (i.e., how many observations) the dataset contains using the dim() function. This function create a small list with two values, the first one is the number of row, the second one is the number of columns. In the code below, I am saving that number of observations (named num_obs) for later usage.
#Create a "a" object with number of rows and columns
a <- dim(cyst)
#Save and check the first value of the "a" object
num_obs <- a[1]
num_obs
## [1] 2161
There are 2161 observations in the dataset.
For a conventional logistic regression we could describe the likelihood function linking unknown parameters to observed data as follow:
\(test \sim dbern(P)\)
\(logit(P) = β_0 + β_1*quest\)
In other words: the outcome test follows a Bernoulli distribution (i.e., it can take the value 0 or 1) with probability P of the value being a 1. The logit of this probability is described by an intercept (\(β_0\)) and a coefficient (\(β_1\)). The latter is multiplied by the value of the predictor quest.
In this case, the values of test and quest are the observed data and the intercept (\(β_0\)) and coefficient (\(β_1\)) are the unknown parameters.
We will have to specify a prior distribution for each unknown parameter (using the \(\sim\) sign) or to assign a specific value to it (using the \(<-\) sign). With the former, we allow for a range of values (stochastic), while the latter would be a deterministic approach.
In the preceding example, we only have an intercept (\(β_0\)) and a coefficient (\(β_1\)). Theoretically such parameters could take values from -infinite to +infinite, so a normal distribution would be one way to describe their prior distributions. Moreover, if we want to use flat priors, we could simply choose \(mu=0.0\) and \(1/tau=0.0001\) to describe these distributions (i.e., centred on 0.0 with a large variance; \(var = 10,000\)).
In the following code, I am creating R objects for mu and 1/tau (i.e., the inverse of the variance) for \(β_0\) and for \(β_1\) prior distributions.
options(scipen=999) # To avoid use of scientific notation by R
mu_int <- 0.0 #mu for the intercept
inv_var_int <- 0.0001 #Inverse variance for the intercept
mu_betaquest <- 0.0 #mu for the coefficient
inv_var_betaquest <- 0.0001 #Inverse variance for the coefficient
Our Bayesian model has to be presented to OpenBUGS as a text file (i.e., a .txt document). To create this text file, we will use the paste0() function to write down the model and then the write.table() function to save it as a .txt file.
An interesting feature of the paste0() function is the possibility to include R objects in the text. For instance, you may want to have a general model that will not be modified, but to change the values used to describe the priors. Within your paste0() function you could call these “external” values. Thus, you just have to change the value of the R object that your model is using, rather than rewriting a new model. For instance, we already created a R object called num_obs (which is just the number of observations in the dataset). We can create a short text as follow:
text1 <- paste0("for( i in 1 : 2161 )")
text1
## [1] "for( i in 1 : 2161 )"
But, if the model is to be applied later to another dataset with a different number of observations, it may be more convenient to use:
text2 <- paste0("for( i in 1 : ",num_obs," )")
text2
## [1] "for( i in 1 : 2161 )"
When using paste0(), any text appearing between sets of quotations marks and within commas (i.e.," “,bla-bla,” ") will be considered as a R object that need to be included between the pieces of text included in each quotation.
Some important notes:
OpenBUGS is expecting to see the word “model” followed by curly brackets { }. The curly brackets will contain the likelihood function AND the priors.
Similarly to R, any line starting with a # will be ignored (these are comments).
Similarly to R, <- means equal.
The ~ symbol means “follow a distribution.”
Similarly to R OpenBUGS is case sensitive (i.e., ORQUEST does not equal orquest).
Loops can be used with keyword “for” followed by curly brackets.
Array can be indicated using squared brackets [ ].
In the example below, I create a R object called model using the paste0() function and then save it as a text file using the write.table() function. We will check this text file below and then explain the code for the model.
# Specify the model:
model <- paste0("model{
#=== LIKELIHOOD ===#
for( i in 1 : ",num_obs," ) {
test[i] ~ dbern(p[i])
logit(p[i]) <- int + betaquest*quest[i]
}
#=== EXTRA CALCULATION TO GET OR ===#
ORquest <- exp(betaquest)
#=== PRIOR ===#
int ~ dnorm(",mu_int,", ",inv_var_int,") #FLAT PRIOR
betaquest ~ dnorm(",mu_betaquest,", ",inv_var_betaquest,") #FLAT PRIOR
}")
#write to temporary text file
write.table(model, file="model.txt", quote=FALSE, sep="", row.names=FALSE,
col.names=FALSE)
Here is a snapshot of the final result (i.e., the text file):
If we look more closely at the likelihood function:
for( i in 1 : 2161 ) {
test[i] \(\sim\) dbern(p[i])
logit(p[i]) <- int + betaquest*quest[i]
}
It is read by OpenBUGS as:
- For observation #1, the observed value for test (test[1]) follows a Bernoulli distribution with a probability (p[1]) that it is equal to 1. This probability is linked to a combination of linear predictors, through the logit function, and is function of the observed quest value (quest[1]).
- For observation #2, the observed value for…
- …
- For observation #2161, the observed value for…
So, on a given iteration, the software will go through the entire dataset to try to assess the potential values for our unknown parameters (int and betaquest).
In the next piece of code:
ORquest <- exp(betaquest)
We are just transforming the coefficient for the variable quest into an \(OR\) using the exponent of betaquest.
The last part is simply the specification of our prior distributions for the two unknown parameters (int and betaquest).
The initial value is the first value where each Markov chain will start.
- You can specify one value for each parameter for which you have specified a distribution (~) in the model.
- If you decide to ran three Markov chains, you will need three different sets of initial values.
For instance, in our preceding example we have two unknown parameters (int and betaquest). If we were to run three Markov chains, we could generate three sets of initial values for these two parameters as follow. Of course, the chosen values could be changed.
#Initializing values for 3 chains
inits <- list(
list(int=0.0, betaquest=0.0),
list(int=1.0, betaquest=1.0),
list(int=-1.0, betaquest=-1.0)
)
We know have a R object that I have called inits and it contains the values where each of the three Markov chains will start.
We now have the three elements that we needed: data, a model, and a set of initial values. We can now use the R2OpenBUGS library to run our Bayesian analysis. The function that we will use for that is the bugs() function. Here are the arguments that we need to specify:
data=inits=parameters.to.save=n.iter=n.burnin=.n.thin=. Must be a positive integer. The default is n.thin = 1 (no thinning).n.chains=. The default is 3.model.file=.The debug=TRUE or debug=FALSE argument is controlling whether OpenBUGS remains open after the analysis. If FALSE (default), OpenBUGS is closed automatically when the script has finished running, otherwise OpenBUGS remains open for further investigation. Keeping the software open will help investigating any error message. However, R will then wait for you to close OpenBUGS before continuing with any subsequent lines of code. In situations where many analyses are ran sequentially, then it may be advantageous to have the software to close automatically after each analysis.
The last argument DIC=TRUE or DIC=FALSE will determine whether to compute the deviance, pD, and deviance information criteria (DIC). These could sometimes be useful for model selection/comparison. We could leave that to DIC=FALSE since we will not be using these values in the workshop.
When conducting any analysis, OpenBUGS will go through these steps and will provide these associated messages (if everything is fine):
– Making sure the model in the text file is sound. Message: Model is syntactically correct
– Loading the data. Message: Data loaded
– Making sure that the model and data correspond. Message: Model compiled
– Loading the initial values. Message: Initial values loaded
– Making sure that all unknown parameters have initial values (and generating them if you have not fully specified them). Message: Model is initialized
- Starting the update (i.e., running the Markov chains as specified). Message: Model is updating
Do not worry if you appear to be stuck on this last step. Bayesian analyses are an iterative process, with a moderate size dataset, it may take a few seconds or minutes to complete the process. On my computer, the following model took 38 seconds to run.
With the following code, we will create a R object called bug.out. This object will contains the results of running our model, with the specified priors, using the available data, and the sets of initial values that we have created. For this analysis, we have asked to run 3 Markov chains, each for 5000 iterations and we discarded 0 iteration (so we have not yet specified a proper burn in period). Finally, we will monitor our two main unknown parameters (int and betaquest) and the computed \(OR\) for quest (ORquest).
library(R2OpenBUGS)
#Set number of iterations
niterations <- 5000
#Run the Bayesian model
bug.out <- bugs(data=cyst,
inits=inits,
parameters.to.save=c("int", "betaquest", "ORquest"),
n.iter=niterations,
n.burnin=0,
n.thin=1,
n.chains=3,
model.file="model.txt",
debug=FALSE,
DIC=FALSE)
Before jumping to the results, you should first check:
- Whether the chains have converged,
- What the burn in period should be or, if you already specified one, if it was sufficient,
- Whether you have a large enough sample of values to describe the posterior distribution with sufficient precision,
- Whether the Markov chains behaved well (mainly their auto-correlation).
There are many other diagnostic methods available for Bayesian analyses, but, for this workshop, we will mainly use basic methods relying on plots.
Diagnostics are available when you convert your model output into an MCMC object using the as.mcmc() function of the mcmcplot library.
library(mcmcplots)
bug.mcmc <- as.mcmc(bug.out)
You can then use the mcmcplot() function on this newly created MCMC object.
mcmcplot(bug.mcmc, title="Diagnostic plots")
When you do that, a HTML document will be created an your web browser will automatically open it. Here is a snapshot of the HTML file in my browser:
To check whether convergence of the Markov chains was obtained you can inspect the trace plot presenting the values that were picked by each chain on each iterations. The different chains that you used (three in the preceding example) will be overlaid on the trace plot. You should inspect the trace plot to evaluate whether the different chains all converged in the same area. If so, after a number of iterations they should be moving in a relatively restricted area and this area should be the same for all chains.
Here is an example of a trace plot of three “nice” Markov chains (from our preceding example, lucky us!).
In this preceding example, we can see that the chains started from different initial values, but they very quickly converged (after less than a 100 iterations) toward the same area. They were then moving freely within this limited area (thus sampling from the posterior distribution). We could conclude based on this trace plot that:
- The Markov chains did converged
- A short burn in period of less than a 100 iterations would be sufficient. Note that we will often use longer burn in (e.g., 1000) just to be sure and because it only costs a few more seconds on your computer…
Here is another example where two chains (the red and green) converged in the same area, but a third one (the blue) also converged, but in a different area. We should be worry in such case.
Finally, here is a last example with just one Markov chain and it clearly did not converged, even after 25,000 iterations. You should be terrified by that!!! ;-).
Next, you can inspect the posterior distributions to determine if you have a large enough sample of independent values to describe with precision the median estimate and the 2.5th and 97.5th percentiles. The effective sample size (EES) takes into account the number of values generated by the Markov chains AND whether these values were autocorrelated, to compute the number of “effective” independent values that are used to described the posterior distribution. Chains that are autocorrelated will generate a smaller number of “effective values” than chains that are not autocorrelated.
The effectiveSize() function of the coda library provide a way to appraise whether you add enough iterations. In the current example, we already created a mcmc object named bug.mcmc. We can ask for the effective sample size as follow:
effectiveSize(bug.mcmc)
## int betaquest ORquest
## 11009.48 10433.54 10675.81
You could, for example, decide on an arbitrary rule for ESS (say 10,000 values) and then adapt the length of the chain to achieve such an ESS (for each of the selected parameters, because the effective sample sizes will not be the same for all parameters).
Remember, a feature of Markov chains is to have some autocorrelation between two immediately subsequent iterations and then a correlation that quickly goes down to zero (i.e., they have a short memory). The autocorrelation plot can be used to assess if this feature is respected.
From our previous example, it seems that it is the case.
Here is another example that should be worrying.
When such a behaviour is observed, running the chains for more iterations will achieve the desire ESS. In some situations, specifying different priors (i.e., increasing the amount of information in priors) may also help.
Now, if you are happy with the behaviour of your Markov chains and you determined what would be an appropriate number of iterations and burn in period, you are left with two options:
R2OpenBUGS bugs() function, while adjusting the burn in (n.burnin=) and, if needed, the number of iterations (n.iter=). For publication, THIS IS THE ABSOLUTE BEST OPTION, because you will then have the exact ESS, trace plots, etc.R2OpenBUGS again.For a final analysis (e.g., for publication), you will re-run the bugs() function, and indicate a burn in period (e.g., n.burnin=1000). If needed, you could also increase the number of iterations (n.iter=) if the ESS was too small. For instance, in the code below, I kept the number of iteration unchanged (n=5,000), but indicated a burn in period of 1,000 iterations :
library(R2OpenBUGS)
#Set number of iterations
niterations <- 5000
nburnin <- 1000
#Run the Bayesian model
bug.out <- bugs(data=cyst,
inits=inits,
parameters.to.save=c("int", "betaquest", "ORquest"),
n.iter=niterations,
n.burnin=nburnin,
n.thin=1,
n.chains=3,
model.file="model.txt",
debug=FALSE,
DIC=FALSE)
When you have specified the burn in period directly in the bugs() function, you can summarize your results (i.e., report the median, 2.5th and 97.5th percentiles) very easily with the print() function. Below, I have also indicated digits.summary=3 to adjust the precision of the estimates that are reported (the default is 1).
print(bug.out, digits.summary=3)
## Inference for Bugs model at "model.txt",
## Current: 3 chains, each with 5000 iterations (first 1000 discarded)
## Cumulative: n.sims = 12000 iterations saved
## mean sd 2.5% 25% 50% 75% 97.5% Rhat n.eff
## int -1.363 0.084 -1.527 -1.420 -1.362 -1.306 -1.201 1.001 12000
## betaquest -0.761 0.124 -1.005 -0.845 -0.761 -0.677 -0.519 1.001 12000
## ORquest 0.471 0.058 0.366 0.430 0.467 0.508 0.595 1.001 12000
##
## For each parameter, n.eff is a crude measure of effective sample size,
## and Rhat is the potential scale reduction factor (at convergence, Rhat=1).
Here you see that the median OR estimate (95CI) was 0.47 (0.37, 0.60).
Finally, we could also plot the prior and posterior distributions on a same figure (rather than just reporting the median and 95 credible interval). Below, I will create a plot with the prior and posterior distributions for the estimated coefficient of the quest variable.
First, if you inspect the bug.out object that was created using R2OpenBUGS, you will notice a sims.list object which is made of 3 lists (one for each of the 3 parameters that were monitored; int, betaquest, and ORquest). Each list is made of 12,000 values, which are the values from the three chains assembled together (5000 values minus a burn in of 1000, and, finally, times 3 chains).
For plotting these values, I can use the plot() function. More specifically, I asked for a density curve (density()) of my estimate (in our case, the betaquest list in the sims.list element of the bug.out object. Using $ signs indicate the variable, in the list, of the object.
Then, I could add to this plot my prior distribution for that same parameter. Remember, for OpenBUGS, I have specified that this parameter followed a Normal distribution with mean=mu_betaquest (0.0) and inverse variance=inv_var_betaquest (0.0001). The curve() function, however, requires the mean and standard deviation (SD) to plot a given Normal distribution. Therefore, I first computed this standard deviation (which is the square root of the variance, or , in this case, the square root of the inverse of the inverse variance) and then asked for a Normal curve with this mean and SD.
The lty= option indicate the line type (2 is for dashed) and the add=TRUE argument simply indicates to put both curves on the same plot.
#Computing the SD for my prior distribution
std <- sqrt(1/inv_var_betaquest)
#Plotting the posterior distribution
plot(density(x=bug.out$sims.list$betaquest),
main="Quest coefficient",
xlab="Value", ylab="Density",
)
#Adding the prior distribution
curve(dnorm(x,
mean=(mu_betaquest),
sd=std
),
lty=2,
col="red",
add=TRUE
)
Figure. Prior (dashed red line) and posterior (full black line) distribution of the coefficient for the variable quest.
In this example, since the prior distribution was a vague distribution, the prior distribution is not obvious on the figure, it is the straight dashed red line at the bottom of the figure.
Throughout the exercises, I will use this option most of the time, but remember this is just for preliminary analyses. When you are ready for publication, run the R2OpenBUGS script with the appropriate number of iterations and burn in period.
Nonetheless, if we want to use the already available output, you could use the seq() function which is a basic R function to specify the first iteration to keep (using from=) and the last iteration to keep (using to=). The by=1 argument, indicates that each values are to be retained. For instance, by=2, would indicate to keep every other value. As an example, with the code below, I am creating a R object named k which is simply a list of values from 1001 to 5000 by increment of 1. I will later combine this list with my bug.out object to identify the values from my Markov chains that are to be kept.
# Set burn-in to 1000
burnin <- 1000
# Create a sequence of values between 1001 and 5001
k <- seq(from=(burnin+1), to=niterations, by=1)
Now, if you inspect the bug.out object that was created earlier using R2OpenBUGS, you will notice an element named sims.array. As you can see below, this element is made of 3 times 3 lists of 5000 values. This correspond to the 5000 values sampled (1/iteration) by each Markov chain (3 chains) for the 3 parameters that were monitored (int, betaquest, and ORquest). Within this element, the chains were not assembled together yet.
Figure. sims.array element of the bug.out R2OpenBUGS output.
We could combine these values using the data.frame() and rbind() functions with the list of iteration k that I created to organize a dataset where I have all the values from the three Markov chains for a given parameter in a same column, and where the values belonging to the burn in period are now excluded, retrospectively.
# Combine chains post-burn in
estimates_conv_logistic <- data.frame(rbind(bug.out$sims.array[k,1,],bug.out$sims.array[k,2,],bug.out$sims.array[k,3,]))
dim(estimates_conv_logistic)
## [1] 12000 3
Here we can see that it is a dataset with 12,000 rows (3 Markov chains times 4000 iterations [5000 iterations minus the burn in of 1000 iterations]) and 3 columns. Below are the first 6 observations from this dataset.
head(estimates_conv_logistic)
## int betaquest ORquest
## 1 -1.392 -0.8091 0.4453
## 2 -1.319 -0.7896 0.4540
## 3 -1.391 -0.7062 0.4935
## 4 -1.369 -0.7952 0.4515
## 5 -1.177 -0.8776 0.4158
## 6 -1.387 -0.7423 0.4760
All that we have left to do now, is to compute the median and 2.5th and 97.5th percentiles for the parameters of interest. For that, we can use the apply() function. We need to indicate:
- the dataset to use (X=),
- whether it is the rows (MARGIN=1), or columns (MARGIN=2) for which descriptive statistics must be computed (in this case we wish to obtain descriptive statistic for each column),
- the actual descriptive statistic that we need (FUN=; in this case we want the quantiles, FUN=quantile),
- and, finally, arguments that are specific to the chosen descriptive statistic; here the percentiles that we would like to have reported (probs=c(0.5, 0.025, 0.975)).
# medians and 95% credible intervals
a <- apply(X=estimates_conv_logistic, MARGIN=2, FUN=quantile, probs=c(0.5, 0.025, 0.975))
a
## int betaquest ORquest
## 50% -1.362 -0.761150 0.4671000
## 2.5% -1.527 -1.005025 0.3659000
## 97.5% -1.201 -0.518895 0.5952025
Since it is not super nice to have the parameters as columns and statistic as rows, we could transpose this table using the t() function.
b <- t(a)
b
## 50% 2.5% 97.5%
## int -1.36200 -1.527000 -1.2010000
## betaquest -0.76115 -1.005025 -0.5188950
## ORquest 0.46710 0.365900 0.5952025
Finally, rounding off a bit never hurts. The round() function can be used for that.
res_conv_logistic <- round(b, digits=3)
res_conv_logistic
## 50% 2.5% 97.5%
## int -1.362 -1.527 -1.201
## betaquest -0.761 -1.005 -0.519
## ORquest 0.467 0.366 0.595
Of course, all of this can be done in one single step.
res_conv_logistic <- round(
t(
apply(X=estimates_conv_logistic, MARGIN=2, FUN=quantile, probs=c(0.5, 0.025, 0.975))
),
digits=3)
Finally, we could also plot the prior and posterior distributions on a same figure, but we would now use the variable betaquest in the data frame estimates_conv_logistic, or estimates_conv_logistic$betaquest).
std <- sqrt(1/inv_var_betaquest)
plot(density(x=estimates_conv_logistic$betaquest),
main="Quest coefficient",
xlab="Value", ylab="Density",
)
curve(dnorm(x,
mean=(mu_betaquest),
sd=std
),
lty=2,
col="red",
add=TRUE
)
Figure. Prior (dashed red line) and posterior (full black line) distribution of the coefficient for the variable quest.
Folder: Exercise 1 - Intro Bayesian
Open up the R project for Exercise 1 (i.e., the R project file named Exercise 1 - Intro Bayesian.Rproj).
In the Question 1 folder, you will find a partially completed R script named Question 1.R. To answer the following questions, try to complete the missing parts (they are highlighted by a #TO COMPLETE# comment). We also provided complete R scripts, but try to work it out on your own first!
This exercise is based on the study data set (cyst_bias_study.csv; n=2,161) with imperfect measurement of both cysticercosis (test) and knowledge (quest). For question 4, we also provided a smaller version of that data set where we randomly picked 60 observations, it is named cyst_bias_study_reduced.csv.
Start from the partially completed Question 1.R R script located in Question 1 folder. Use this script as a starting point to compute the crude association between quest and test using a logistic regression model estimated with a Bayesian analysis. Initially, use flat priors for your parameters. Make sure Markov chains’ behaviours are acceptable (i.e., ESS, autocorrelation plots) and that convergence is obtained (i.e., trace plot). How these results compare to those of the unadjusted \(OR\) computed using a frequentist approach (i.e., OR=0.47; 95%CI=0.37, 0.60)?
Start from the R script that you completed in Question 1. Run the same Bayesian logistic model, but this time use a prior distribution for quest coefficient that would give equal probability on the log odds scale to values between odds ratio ranging from 0.05 to 20. With such a distribution, you are explicitly saying that, a priori, you do not think the odds ratio could be very extreme, but that all intermediate values are possible. Hint: it is just the prior distributions that need to be modified in two places; 1) when creating your R objects and 2) in the model’s text file. How are your results modified?
Again, start from the R script that you completed for Question 1 or Question 2. Let go crazy a bit. For instance, you could believe that knowledge does not prevent the disease, but rather will increase odds of cysticercosis. Let say you are very convinced, a priori, that the most probable value for the \(OR\) is 2.0 with relatively little variance in that distribution. Pick a distribution that would represent that prior knowledge. How are your results modified?
If you have the time… Again, you could modify some of your preceding R scripts. To appraise the effect of priors vs. sample size, work on similar analyses, but this time use the cyst_bias_study_reduced.csv data set, which only have 60 observations (compared to 2,161 observations for the complete data set). Run a logistic regression model of the effect of quest on test, using vague prior distributions. Then run the same model, but with the prior distributions that you used in Question 3. Is the effect of the informative priors more important now?
When dealing with an unmeasured confounder, two different scenarios can arise. First the unmeasured confounder can solely act as a confounder without modifying the effect of the exposure on the disease (i.e., no effect measure modification; EMM). In the figure below, the schema on the left illustrate this first scenario. In other situations, the confounder is confounding the exposure-disease association, but is also an EMM of interest, as illustrated by the schema on the right side of the figure. The Bayesian model that we will use to deal with an unmeasured confounder will vary as function of the situation.
Figure. DAG illustrating two types of confounding.
When we are simply dealing with an unmeasured confounder that is not expected to be an EMM, an easy option is to run a conventional logistic model using a Bayesian framework, and to simply apply the adjustment described by Lash et al. (2009).1 within the estimation process.
With:
- \(OR(biased)\) being the biased OR that could be computed using a conventional logistic model;
- \(OR(ud)\) being the hypothesized OR between confounder and disease;
- \(PU(exp)\) being the hypothesized prevalence of the confounder in the exposed;
- \(PU(unexp)\) being the hypothesized prevalence of the confounder in the unexposed.
Then, we can compute an adjusted OR (\(OR(adj)\)) describing the relation between exposure and disease after adjusting for the unmeasured confounder as:
\(OR(adj) = OR(biased) * \frac{OR(ud)*PU(unexp)+1-PU(unexp)}{OR(ud)*PU(exp)+1-PU(exp)}\)
We could thus, simply:
1. Provides prior distributions for the intercept, the coefficient of interest, but also for the \(OR(ud)\), \(PU(exp)\), and \(PU(unexp)\). Actually, for \(OR(ud)\), it will be easier to provide a prior for the log of \(OR(ud)\), due to the asymmetry of the OR. NOTE THAT: instead of providing prior distributions for log of \(OR(ud)\), \(PU(exp)\), and \(PU(unexp)\), you could also specify an exact value for these (e.g., PU(exp) = 0.30). The adjustment would then be deterministic, rather than stochastic, for the biased parameters;
2. Develop a conventional logistic Bayesian model;
3. Include the above equation to compute the \(OR(adj)\).
If we use uniform vague distributions for the intercept and the coefficient of interest:
options(scipen=999) # To avoid use of scientific notation by R
mu_int <- 0.0 #mu for the intercept
inv_var_int <- 0.0001 #Inverse variance for the intercept
mu_betaquest <- 0.0 #mu for the coefficient
inv_var_betaquest <- 0.0001 #Inverse variance for the coefficient
We could use a normal distribution for the log of \(OR(ud)\):
options(scipen=999)
mu_log_OR_ud <- 1.6 #Mean association (on log odds scale) between confounder and disease (corresponding to an OR of 4.95)
inv_var_log_OR_ud <- 5 #Inverse variance for the association (on log odds scale) between confounder and disease (corresponding to a variance of 0.2)
We could use beta distributions for \(PU(exp)\) and \(PU(unexp)\), since these are probabilities (i.e., we would like to bound them between 0.0 and 1.0). Here, using the epi.betabuster() function from the epiR library could help. For instance, if we believe that mode for \(PU(exp)\) is 0.60 and that we are 95% certain that it is above 0.50, we could find a corresponding beta distribution for \(PU(exp)\) as follow:
library(epiR)
rval1 <- epi.betabuster(mode=0.60, conf=0.95, greaterthan=T, x=0.50) # I create a new object named rval1
rval1$shape1 #View the a shape parameter
## [1] 42.009
rval1$shape2 ##View the b shape parameter
## [1] 28.33933
#plot the prior distribution
curve(dbeta(x, shape1=rval1$shape1, shape2=rval1$shape2), from=0, to=1,
main="Prior for PU(exp)", xlab = "Proportion", ylab = "Density")
If we believe that mode for \(PU(unexp)\) is 0.20 and that we are 95% certain that it is below 0.30, we could find a corresponding beta distribution for \(PU(exp)\) as follow:
library(epiR)
rval2 <- epi.betabuster(mode=0.20, conf=0.95, greaterthan=F, x=0.30)
rval2$shape1 #View the a shape parameter
## [1] 12.821
rval2$shape2 ##View the b shape parameter
## [1] 48.284
#plot the prior distribution
curve(dbeta(x, shape1=rval2$shape1, shape2=rval2$shape2), from=0, to=1,
main="Prior for PU(unexp)", xlab = "Proportion", ylab = "Density")
With these priors being determined, the following code would create the text file for the model:
# Specify the model:
model_unm_conf_lash <- paste0("model{
#=== LIKELIHOOD ===#
for( i in 1 : ",num_obs," ) {
test[i] ~ dbern(p[i])
logit(p[i]) <- int + betaquest*quest[i]
}
#=== EXTRA CALCULATION TO GET OR_biased and OR_adj ===#
OR_biased <- exp(betaquest)
OR_adj <- OR_biased*( (OR_ud*PU_unexp+(1-PU_unexp))/(OR_ud*PU_exp+(1-PU_exp)))
#=== PRIORS ===#
int ~ dnorm(",mu_int,", ",inv_var_int,") #FLAT PRIOR
betaquest ~ dnorm(",mu_betaquest,", ",inv_var_betaquest,") #FLAT PRIOR
log_OR_ud ~ dnorm(", mu_log_OR_ud,"," , inv_var_log_OR_ud,") #PRIOR FOR log_OR_ud
OR_ud <- exp(log_OR_ud) #LINK OR_ud TO THE LOG_OR_UD
PU_exp ~ dbeta(", rval1$shape1,",", rval1$shape2, ") #PRIOR FOR PREVALENCE OF THE CONFOUNDER IN THE EXPOSED INDIVIDUALS
PU_unexp ~ dbeta(", rval2$shape1,",", rval2$shape2, ") #PRIOR FOR PREVALENCE OF THE CONFOUNDER IN THE UNEXPOSED INDIVIDUALS
}")
#write to temporary text file
write.table(model_unm_conf_lash, file="model_unm_conf_lash.txt", quote=FALSE, sep="", row.names=FALSE,
col.names=FALSE)
This code would create the following .txt file:
Note that we could provide initial values for all the unknown parameters. So, not just int and betaquest, but also log_OR_ud, PU_exp, and PU_unexp. For instance:
inits <- list(
list(int=0.0, betaquest=0.0, log_OR_ud=0, PU_exp=0.20, PU_unexp=0.20),
list(int=1.0, betaquest=1.0, log_OR_ud=1.0, PU_exp=0.30, PU_unexp=0.30),
list(int=-1.0, betaquest=-1.0, log_OR_ud=-1.0, PU_exp=0.10, PU_unexp=0.10)
)
When running the Bayesian analysis, you will have to indicate the parameters that you want to monitor:
library(R2OpenBUGS)
#Set number of iterations
niterations <- 5000
#Run the Bayesian model
bug.out.vague <- bugs(data=cyst,
inits=inits,
parameters.to.save=c("int", "betaquest", "OR_biased", "OR_adj"), #I CHOSE TO MONITOR int, betaquest, OR_biased, AND OR_adj
n.iter=niterations,
n.burnin=0,
n.thin=1,
n.chains=3,
model.file="model_unm_conf_lash.txt",
debug=FALSE,
DIC=FALSE)
McCandless et al. (2007).2 also proposed methods for dealing with an unmeasured confounder when no EMM is present. The advantage of their proposed method is that it can be extended to a scenario where a confounder is also an EMM. In their approach, the unmeasured confounder (UC) is considered as a latent variable. A latent variable is a variable that can not be measured (i.e., it is not in the data set), but that will be included in the model. In that case, it will be included in the model as followed:
\(Disease \sim dbern(P)\)
\(logit(P) = β_0 + β_1*exposure + β_2*UC\)
Of course, since the latent variable UC is not in the data set, we will need to link this latent variable to other parameters. First, we will explicitly link the value (0 or 1) of the unmeasured confounder to the status of the exposure as follow:
\(UC \sim dbern(Pu)\)
\(logit(Pu) = γ_0 + γ_1*exposure\)
So the value of the unmeasured confounder can be 0 or 1, and it has a probability \(Pu\) of being 1. This latter probability depends on the exposure status. It makes sense since a confounder has to be associated with the disease of interest, but also with the exposure.
Then, we will make use of the following concepts:
- The hypothesized OR representing the association between the unmeasured confounder and the disease (\(OR(ud)\)) can be computed as the exponent of the \(β_2\) coefficient in the first equation (\(OR(ud)=exp(β_2)\));
- The prevalence of the unmeasured confounder among exposed (\(PU(exp)\)) can be estimated using the second equation, through the invert logit function: \(PU(exp) = 1 / (1+exp(-1*(γ_0 + γ_1)))\);
- Similarly, the prevalence of the unmeasured confounder among unexposed: \(PU(unexp)\)) can also be estimated (\(PU(unexp) = 1 / (1+exp(-1*(γ_0)))\);
For modelling purposes, it may be more convenient to isolate \(γ_0\) and \(γ_1\) in the latter two equations as follow:
\(γ_1 = -log(1/PU(exp)-1)-γ_0\)
\(γ_0 = -log(1/PU(unexp)-1)\)
With that, we can create a text file with our model where we will use the exact same unknown parameters as the lash’s method (a coefficient for the intercept \(β_0\), a coefficient \(β_1\) representing the adjusted effect of the exposure, \(OR(ud)\), \(PU(exp)\), and \(PU(unexp)\)). Finally, using the coefficient \(β_1\) representing the adjusted effect of the exposure, we will be able to compute the adjusted OR (\(OR(adj)\)). Here is how we could create this model. The resulting .txt file is illustrated below.
# Specify the model:
model_unm_conf_McCandless <- paste0("model{
#=== LIKELIHOOD ===#
for( i in 1 : ",num_obs," ) {
test[i] ~ dbern(p[i])
logit(p[i]) <- int + betaquest*quest[i] + betaUC*UC[i] #UC IS AN UNMEASURED CONFOUNDER (LATENT)
UC[i] ~ dbern(Pu[i]) #THE VALUE OF UC HAS PROBABILITY Pu OF BEING 1
logit(Pu[i]) <- gamma + gammaquest*quest[i] #Pu VARIES AS FUNCTION OF THE EXPOSURE STATUS
}
gamma <- log(1/PU_unexp - 1)*-1 #LINK BETWEEN gamma and PU_unexp
gammaquest <- (log(1/PU_exp - 1)*-1)-gamma #LINK BETWEEN gammaquest AND PU_exp
#=== PRIORS ===#
int ~ dnorm(",mu_int,", ",inv_var_int,") #FLAT PRIOR
betaquest ~ dnorm(",mu_betaquest,", ",inv_var_betaquest,") #FLAT PRIOR
betaUC ~ dnorm(", mu_log_OR_ud,"," , inv_var_log_OR_ud,") #PRIOR FOR betaUC WHICH IS SIMPLY OR_ud, BUT ON THE LOG SCALE
PU_exp ~ dbeta(", rval1$shape1,",", rval1$shape2, ") #PRIOR FOR PREVALENCE OF THE CONFOUNDER IN THE EXPOSED INDIVIDUALS
PU_unexp ~ dbeta(", rval2$shape1,",", rval2$shape2, ") #PRIOR FOR PREVALENCE OF THE CONFOUNDER IN THE UNEXPOSED INDIVIDUALS
#=== EXTRA CALCULATION TO GET OR(adj) ===#
ORquest<-exp(betaquest)
}")
#write to temporary text file
write.table(model_unm_conf_McCandless, file="model_unm_conf_McCandless.txt", quote=FALSE, sep="", row.names=FALSE,
col.names=FALSE)
Figure. Model unmeasured confounder (McCandless’ adjustment).
NOTE: McCandless model takes a lot more time to run than the ones where we simply apply Lash’s adjustment. Don’t be surprise if OpenBUGS takes a long time to spit out the results. Moreover, by experience, I see more problems (mainly a high autocorrelation of the Markov Chains) with the McCandless model (vs. the Lash-derived model). So inspect your Markov chains carefully. You may have to run the chains for more iterations to get a sufficient effective sample size. You may also have to use informative priors for all your parameters.
Folder: Exercise 2 - Unmeasured confounder
Open up the R project for Exercise 2 (i.e., the R project file named Exercise 2 - Unmeasured confounder.Rproj).
We will re-investigate exercise #4 from the QBA part of the course on adjusting for an unmeasured confounder, but using a Bayesian analysis. This latter exercise was described as follows:
In this exercise, we will assume that pig may have been a confounder but for some inexplicable reason, pig ownership was not measured or not reported. From other literature / data sources you come up with estimates of the following bias parameters:
- There is (for obvious reasons) a strong association between pig ownership and risk of cysticercosis – you estimate the RR (or OR) to be approximately 5.
- There is less information about the prevalence of pig ownership among individuals who know (exposed) or don’t know (unexposed) about cysticercosis. You run a small pilot study and come up with p(pig|know=0) = 20% and p(pig|know=1) = 60%
In the Question 1 folder, you will find a partially completed R script named Question 1.R. To answer the following questions, try to complete the missing parts (they are highlighted by a #TO COMPLETE# comment). Again, we also provided complete R scripts, but try to work it out on your own first!
R script located in Question 1 folder. Try to adjust the OR obtained from a conventional logistic model using Lash’s method. Use vague priors for the intercept and the coefficient of your logistic model. assume that you are not 100% sure about the values that should be used for your bias parameters, so use distributions for these, rather than exact values. For instance we could assume that:How these results compare to the one obtained with the QBA analyses?
Start from the R script that you completed in Question 1. To investigate effect of the imprecision in your bias parameters, change the normal distribution for the log of the unmeasured confounder-disease OR. Keep a mean of 1.6, but use a larger variance, for instance 1.0 (i.e., an inverse variance of 1.0). How is your CI affected?
Use the Question 3 completed.R R script located in Question 3 folder. This script is already complete, run it piece by piece to see how it works and to re-investigate this bias analysis using the alternative model proposed by McCandless. Currently the script make use of the same bias parameters distributions proposed in question 1. Do you get similar results? You could also try with the bias parameters from Question 2. Note that the McCandless model could easily be expanded to add effect measure modification (i.e., an interaction term between exposure and confounder). Using that expanded model, you could then report effect of exposure by strata of the confounder.
Selection bias will arise when a measured of association between an exposure and a disease is wrongly conditioned on a third variable. For instance, if we were to control the S variable in the figure below.
Figure. DAG representation of selection bias by a variable S.
Remember, we can condition on a variable:
1. by selecting study subjects that all have the same value for the variable (exclusion);
2. by matching subjects based on this value (matching);
3. by including the variable in our model in multiple regression (analytical control).
In the DAG above, we see that conditioning on S is opening a path that was otherwise closed (S collides between exposure-disease).
Of course, if the selection bias was introduced by analytical control, well… Just remove the unneeded variable from your model! Otherwise, to adjust for a selection bias due to the study design, we can use a conventional logistic model in a Bayesian framework, within which the adjustment for selection bias described by Lash et al. (2009) is incorporated within the estimation process.
With:
- \(S(d+e+)\) being the probability of being selected if diseased and exposed;
- \(S(d+e-)\) being the probability of being selected if diseased and unexposed;
- \(S(d-e-)\) being the probability of being selected if healthy and unexposed;
- \(S(d-e+)\) being the probability of being selected if healthy and exposed;
We can compute what is called the selection OR (\(OR(sel)\)) as follows:
\(OR(sel) = \frac{S(d+e-)*S(d-e+)}{S(d+e+)*S(d-e-)}\)
Then, we can compute an adjusted OR (\(OR(adj)\)) simply by multiplying the biased OR (\(OR(biased)\)) by the \(OR(sel)\). This \(OR(adj)\) now describe the relation between exposure and disease after adjusting for the selection bias.
\(OR(adj) = OR(biased) * OR(sel)\)
or
\(OR(adj) = OR(biased) * \frac{S(d+e-)*S(d-e+)}{S(d+e+)*S(d-e-)}\)
We could thus, simply:
1. Provides prior distributions for the intercept, the coefficient of interest, but also for the 4 selection bias parameters (\(S(d+e-)\), \(S(d-e+)\), \(S(d+e+)\), and \(S(d-e-)\)).
NOTE THAT: instead of providing prior distributions for the bias parameters, you could also specify an exact value for these (e.g., \(S(d+e-)=0.15\)). The adjustment would then be deterministic, rather than stochastic, for the biased parameters;
2. Develop a conventional logistic Bayesian model;
3. Include the above equation to compute the \(OR(adj)\).
If we use uniform vague distributions for the intercept and the coefficient of interest:
options(scipen=999) # To avoid use of scientific notation by R
mu_int <- 0.0 #mu for the intercept
inv_var_int <- 0.0001 #Inverse variance for the intercept
mu_betaquest <- 0.0 #mu for the coefficient
inv_var_betaquest <- 0.0001 #Inverse variance for the coefficient
We could use beta distributions for the selection bias parameters, since these are all probabilities (i.e., we would like to bound them between 0.0 and 1.0). Here, we will need to use the epi.betabuster() function from the epiR library to identify the a and b shape parameters for each distribution.
For instance, if we believe that:
- \(S(d+e+)\) has mode of 0.75 and that we are 95% certain that it is above 0.65;
- \(S(d+e-)\) has mode of 0.25 and that we are 95% certain that it is below 0.35;
- \(S(d-e+)\) has mode of 0.30 and that we are 95% certain that it is below 0.40;
- \(S(d-e-)\) has mode of 0.10 and that we are 95% certain that it is below 0.20;
We could find the 4 corresponding beta distributions as follow:
library(epiR)
#S(Dis+ Exp+)
rval.dis.exp <- epi.betabuster(mode=0.75, conf=0.95, greaterthan=T, x=0.65)
rval.dis.exp$shape1
## [1] 48.603
rval.dis.exp$shape2
## [1] 16.86767
#plot the prior distribution
curve(dbeta(x, shape1=rval.dis.exp$shape1, shape2=rval.dis.exp$shape2), from=0, to=1,
main="Prior for S(d+e+)", xlab = "Proportion", ylab = "Density")
#S(Dis+ Exp-)
rval.dis.unexp <- epi.betabuster(mode=0.25, conf=0.95, greaterthan=F, x=0.35)
rval.dis.unexp$shape1
## [1] 16.868
rval.dis.unexp$shape2
## [1] 48.604
#plot the prior distribution
curve(dbeta(x, shape1=rval.dis.unexp$shape1, shape2=rval.dis.unexp$shape2), from=0, to=1,
main="Prior for S(d+e-)", xlab = "Proportion", ylab = "Density")
#S(Dis- Exp+)
rval.undis.exp <- epi.betabuster(mode=0.30, conf=0.95, greaterthan=F, x=0.40)
rval.undis.exp$shape1
## [1] 20.939
rval.undis.exp$shape2
## [1] 47.52433
#plot the prior distribution
curve(dbeta(x, shape1=rval.undis.exp$shape1, shape2=rval.undis.exp$shape2), from=0, to=1,
main="Prior for S(d-e+)", xlab = "Proportion", ylab = "Density")
#S(Dis- Exp-)
rval.undis.unexp <- epi.betabuster(mode=0.10, conf=0.95, greaterthan=F, x=0.20)
rval.undis.unexp$shape1
## [1] 5.619
rval.undis.unexp$shape2
## [1] 42.571
#plot the prior distribution
curve(dbeta(x, shape1=rval.undis.unexp$shape1, shape2=rval.undis.unexp$shape2), from=0, to=1,
main="Prior for S(d-e-)", xlab = "Proportion", ylab = "Density")
With these priors being determined, the following code would create the text file for the model:
# Specify the model:
model_sel <- paste0("model{
#=== LIKELIHOOD ===#
for( i in 1 : ",num_obs," ) {
test[i] ~ dbern(p[i])
logit(p[i]) <- int + betaquest*quest[i]
}
#=== EXTRA CALCULATION TO COMPUTE OR_sel ===#
OR_sel <- (S_dis_unexp*S_undis_exp)/(S_dis_exp*S_undis_unexp)
#=== EXTRA CALCULATION TO GET OR_biased and OR_adj ===#
OR_biased <- exp(betaquest)
OR_adj <- OR_biased*OR_sel
#=== PRIORS ===#
int ~ dnorm(",mu_int,", ",inv_var_int,") #FLAT PRIOR
betaquest ~ dnorm(",mu_betaquest,", ",inv_var_betaquest,") #FLAT PRIOR
S_dis_exp ~ dbeta(", rval.dis.exp$shape1,",",rval.dis.exp$shape2, ") #PRIOR FOR SELECTION PROBABILITY IN D+E+ INDIVIDUALS
S_dis_unexp ~ dbeta(", rval.dis.unexp$shape1,",", rval.dis.unexp$shape2, ") #PRIOR FOR SELECTION PROBABILITY IN D+E- INDIVIDUALS
S_undis_exp ~ dbeta(", rval.undis.exp$shape1,",", rval.undis.exp$shape2, ") #PRIOR FOR SELECTION PROBABILITY IN D-E+ INDIVIDUALS
S_undis_unexp ~ dbeta(", rval.undis.unexp$shape1,",", rval.undis.unexp$shape2, ") #PRIOR FOR SELECTION PROBABILITY IN D-E- INDIVIDUALS
}")
#write to temporary text file
write.table(model_sel, file="model_sel.txt", quote=FALSE, sep="", row.names=FALSE,
col.names=FALSE)
This code would create the following .txt file:
Again, if we want to provide initial values, we will have to provide them for all the unknown parameters (including the selection probabilities). So, not just int and betaquest, but also \(S(d+e-)\), \(S(d-e+)\), \(S(d+e+)\), and \(S(d-e-)\). For instance:
inits <- list(
list(int=0.0, betaquest=0.0, S_dis_exp=0.20, S_dis_unexp=0.20, S_undis_exp=0.20, S_undis_unexp=0.20),
list(int=1.0, betaquest=1.0, S_dis_exp=0.30, S_dis_unexp=0.30, S_undis_exp=0.30, S_undis_unexp=0.30),
list(int=-1.0, betaquest=-1.0, S_dis_exp=0.40, S_dis_unexp=0.40, S_undis_exp=0.40, S_undis_unexp=0.40)
)
When you will run the Bayesian analysis, again you will have to modify the parameters that you want to monitor:
library(R2OpenBUGS)
#Set number of iterations
niterations <- 5000
#Run the Bayesian model
bug.out.sel <- bugs(data=cyst,
inits=inits,
parameters.to.save=c("int", "betaquest", "OR_sel", "OR_biased", "OR_adj"), ### HERE I HAVE ADDED OR_sel, OR_biased, and OR_adj ###
n.iter=niterations,
n.burnin=0,
n.thin=1,
n.chains=3,
model.file="model_sel.txt",
debug=FALSE,
DIC=FALSE)
Folder: Exercise 3 - Selection bias
Open up the R project for Exercise 3 (i.e., the R project file named Exercise 3 - Selection bias.Rproj).
In the Question 1 folder, you will find a partially completed R script named Question 1.R. To answer the following questions, try to complete the missing parts (they are highlighted by a #TO COMPLETE# comment). Again, we also provided complete R scripts, but try to work it out on your own first!
We will re-investigate exercise #5 from the QBA part of the course on adjusting for selection bias, but using a Bayesian analysis.
Exercise #5 was described as follows:
Based on some follow-up investigation of non-participants in the study, you estimate the probability of participation in individuals that neither have cysticercosis, nor know anything about it to be about 10%. You expect knowledge of cysticercosis to increase this by a factor of 3x. Actually having cysticercosis (whether they know it or not) increased participation by a factor of 2.5x. The two factors act independently on the probability of participation.
Selection bias parameters were thus:
- Selection probability among diseased exposed: 0.75
- Selection probability among diseased unexposed: 0.25
- Selection probability among healthy exposed: 0.30
- Selection probability among healthy unexposed: 0.10
Start from the partially completed Question 1.R R script located in Question 1 folder. Try to adjust the OR obtained from a conventional logistic model using the equation provided during the lecture on selection bias. Use vague priors for the intercept and the coefficient of your logistic model. As a first step, for the bias parameters use a deterministic approach (i.e., exact values) rather than prior distributions. How these results compare to that of exercise #5? What is going on?
Modify the R script that you have completed for Question 1. Now assume that selection probability among diseased unexposed is also 0.75. For now stick with a deterministic approach (i.e., exact values) rather than prior distributions for bias parameters. Is there a selection bias in that case?
Modify the R script that you have completed for Question 2. Use these same bias parameters, but now propose distributions for these. For instance, you could use the proposed selection probabilities as modes of the distributions and have the 5th or 95th percentiles fixed at -10 or +10 percentage-points from that mode. How do your results differ from that of Question 2?
Modify the R script that you have completed for Question 3. If you have the time, repeat the analyses from Question 3, but with 5th or 95th percentiles at -20 or +20 percentage-points from the modes.
When a binary outcome is subject to error, this can introduce a misclassification bias. The misclassification bias will be described as non-differential, when outcome misclassification is not associated with the exposure status, nor with other covariates. It will be described as differential, when misclassification of the health outcome is not equal between exposed and unexposed subjects, or when misclassification is associated with other covariates. Both cases can be handle with a bias analysis.
McInturff et al. (2004)3 have proposed using a latent class model to handle non-differential outcome misclassification. In their model, they proposed to model the effect of an exposure on the true probability of disease (\(P(true)\)) using a logistic regression model where the disease is a latent class variable (i.e, an unmeasured discrete variable). This is the “response” part of their model. Then, they linked \(P(true)\) to the probability of observing the outcome (\(P(observed)\)), using the test sensitivity (\(Se\)) and specificity (\(Sp\)). This is the “measurement error” part of their model. The model can be written as follow:
\(Yobserved \sim dbern(P(observed))\)
\(P(observed) = Se*P(true) + (1-Sp)*(1-Ptrue)\)
\(logit(P(true)) = β_0 + β_1*quest\)
To run the McInturff’s model for non-differential-misclassification in R, you will have to:
1. Provide prior distributions (or exact values) for the test’s Se and Sp. Beta distributions could be used, since these are probabilities;
2. Modify the R script used to generate the model’s .txt file. For instance, the following script would generate such a .txt file:
# Specify the model:
model_mis_non_dif <- paste0("model{
#=== LIKELIHOOD ===#
for( i in 1 : ",num_obs," ) {
test[i] ~ dbern(P_obs[i])
P_obs[i] <- P_true[i]*Se+(1-P_true[i])*(1-Sp) #MEASUREMENT ERROR PART
logit(P_true[i]) <- int + betaquest*quest[i] #RESPONSE PART
}
#=== EXTRA CALCULATION TO GET OR ===#
ORquest <- exp(betaquest)
#=== PRIOR ===#
int ~ dnorm(",mu_int,", ",inv_var_int,") #FLAT PRIOR
betaquest ~ dnorm(",mu_betaquest,", ",inv_var_betaquest,") #FLAT PRIOR
Se ~ dbeta(", rval.se$shape1,",",rval.se$shape2, ") #PRIOR FOR SENSITIVITY
Sp ~ dbeta(", rval.sp$shape1,",",rval.sp$shape2, ") #PRIOR FOR SPECIFICITY
}")
#write to temporary text file
write.table(model_mis_non_dif, file="model_mis_non_dif.txt", quote=FALSE, sep="", row.names=FALSE,
col.names=FALSE)
The generated .txt file is illustrated below:
If you prefer to assign fixed values (rather than distributions) for the test’s \(Se\) and \(Sp\), you could modify the last two lines of the R script.
The McInturff’s model can be extended to handle differential outcome misclassification. For that, we will have to allow the test’s \(Se\) and \(Sp\) to vary according to the value of another variable. We can make the test accuracy vary as function of the exposure, of another covariate in the model (e.g., a measured confounder), or even another characteristic not even in the model.
For instance, if we allow the test’s \(Se\) and \(Sp\) to vary according to the value of the exposure, we could rewrite the measurement error part of the model using two equations:
In exposed: \(P(observed) = Se_1*P(true) + (1-Sp_1)*(1-Ptrue)\)
In unexposed: \(P(observed) = Se_2*P(true) + (1-Sp_2)*(1-Ptrue)\)
Thus, if exposed, we will use \(Se_1\) and \(Sp_1\) to link (\(P(observed)\)) to (\(P(true)\)). In unexposed individuals, we will use \(Se_2\) and \(Sp_2\). We will, therefore, have to:
1. Specify exact values or prior distributions for \(Se_1\), \(Sp_1\), \(Se_2\), and \(Sp_2\);
2. Indicate in the model’s .txt file whether, for a given individual, we need to use \(Se_1\) and \(Sp_1\) vs. \(Se_2\) and \(Sp_2\).
For the latter, we will simply create a variable that takes value 1 in unexposed (i.e., when quest=0) and value 2 in exposed (i.e., when quest=1). Then, we will incorporate the 1 and the 2 next to our \(Se\) and \(Sp\) in our codes. We can achieve this in OpenBUGS as follow.
This line would assign the value 1 or 2, to unexposed, and exposed, respectively.
diff[i] <- quest[i]+1
We could then modify as follow the measurement error part to assign \(Se_1\) and \(Sp_1\) vs. \(Se_2\) and \(Sp_2\) to a given individual as function of his exposure status:
P_obs[i] <- P_true[i] * Se[diff[i]]+(1-P_true[i]) * (1-Sp[diff[i]])
So the complete text file could be generated as follow (after having provided prior distributions or fixed values for \(Se_1\), \(Sp_1\), \(Se_2\), and \(Sp_2\), of course).
# Specify the model:
model_mis_dif <- paste0("model{
#=== LIKELIHOOD ===#
for( i in 1 : ",num_obs," ) {
diff[i] <- quest[i]+1
test[i] ~ dbern(P_obs[i])
P_obs[i] <- P_true[i]*Se[diff[i]]+(1-P_true[i])*(1-Sp[diff[i]])
logit(P_true[i]) <- int + betaquest*quest[i] #RESPONSE PART
}
#=== EXTRA CALCULATION TO GET OR ===#
ORquest <- exp(betaquest)
#=== PRIOR ===#
int ~ dnorm(",mu_int,", ",inv_var_int,") #FLAT PRIOR
betaquest ~ dnorm(",mu_betaquest,", ",inv_var_betaquest,") #FLAT PRIOR
Se[1] ~ dbeta(", rval.se1$shape1,",",rval.se1$shape2, ") #SENSITIVITY UNEXPOSED
Sp[1] ~ dbeta(", rval.sp1$shape1,",",rval.sp1$shape2, ") #SPECIFICITY UNEXPOSED
Se[2] ~ dbeta(", rval.se2$shape1,",",rval.se2$shape2, ") #SENSITIVITY EXPOSED
Sp[2] ~ dbeta(", rval.sp2$shape1,",",rval.sp2$shape2, ") #SPECIFICITY EXPOSED
}")
#write to temporary text file
write.table(model_mis_dif, file="model_mis_dif.txt", quote=FALSE, sep="", row.names=FALSE,
col.names=FALSE)
The text file would look as follow:
If budget allows, one could decide to use a less than perfect test to measure the outcome on all individuals, but also to apply a gold-standard test on a subset of the individuals. The relationship between the imperfect and gold-standard results could be used to generate fixed values or prior distributions for the \(Se\) and \(Sp\) that can then directly be used in McInturff’s latent class model. For instance, we could compile the number of test+, gold standard+, test-, and gold standard- individuals. Using this, we could describe the relation between test+, gold standard+ and test’s \(Se\) as:
\(test+ \sim dbern(Se, goldstandard+)\)
Or, if you prefer, the number of test+ depends on the test’s \(Se\) and the number of true positive. And, for the tests’ \(Sp\):
\(test- \sim dbern(Sp, goldstandard-)\)
Vague priors could be used on the test’s \(Se\) and \(Sp\).
Folder: Exercise 4 - Oucome misclassification
Open up the R project for Exercise 4 (Exercise 4 - Outcome misclassification.Rproj).
In the Question 1 folder, you will find a partially completed R script named Question 1.R. To answer the following questions, try to complete the missing parts (they are highlighted by a #TO COMPLETE# comment). Again, we also provided complete R scripts, but try to work it out on your own first!
We will re-investigate exercise #5 from the QBA part of the course on adjusting for misclassification of outcome, but using a Bayesian analysis.
Exercise #5 was described as follows:
It is likely that you will be able to find estimates of the Se and Sp of the test for cysticercosis in the literature. Lets assume that you found Se=90% and Sp=97%
Start from the partially completed Question 1.R R script located in Question 1 folder. Develop a latent class model that would provide an OR adjusted for outcome misclassification. As a first analysis, use point estimates for the Se and the Sp (i.e., use a deterministic approach). Try the value 1.0 for Se and Sp. Thus, you would now have a model that explicitly admit that Se and Sp of the diagnostic test is perfect. How do these results differ from the conventional logistic model?
Modify the R script that you have completed for Question 1. Stick with the deterministic approach, but now use Se=0.90 and Sp=0.97. Are the results similar to those of exercise #5 from the QBA part of the course?
Modify the R script that you have completed for Question 2. Now use a probabilistic approach where distributions are used to describe your knowledge about Se and Sp of the diagnostic test. For instance, in the literature you may have found that, for the Se, a distribution with mode=0.90 and 5th percentile=0.80 would be realistic. Similarly, the Sp could be described with a distribution with mode=0.97 and 5th percentile=0.90. What is you prediction (a priori) regarding your results? You may have to use informative priors for your unknown parameters. You will possibly have to have a long burn in period and run for many iterations, etc…
Start from the partially completed Question 4.R R script located in Question 4 folder. Using the same data, develop a model that would allow for differential outcome misclassification (i.e., different Se and Sp values in exposed and unexposed). First, start with a deterministic approach. For instance, use Se=0.85 and Sp=0.99 in unexposed and Se=0.95 and Sp=0.95 in exposed.
Modify the R script that you have completed for Question 4. Now use distributions for your bias parameters. Use the previously proposed values as modes for the distributions, and mode minus 5 percentage-points as 5th percentiles. You will probably have to use informative priors for the quest coefficient. You will possibly have to have a long burn in period and run for many iterations, etc…
Similar to outcome misclassification, when a binary exposure is subject to error, this can introduce a misclassification bias. Again, the misclassification bias will be described as non-differential, when exposure misclassification is not associated with the outcome or other covariates. It will be described as differential, when misclassification of the exposure is associated with the outcome or with other covariates. Both cases can be handle with a bias analysis.
Gustafson et al. (2003)4 has proposed a latent class model to handle non-differential misclassification of a binary exposure. In his model, he proposed to model the effect of the true unknown exposure status (a latent class variable) of an individual i, \(Expo(true)_i\), on the probability of disease of this individual, (\(P_i\)), using a logistic regression model. This would be the “response” part of his model.
Response part:
\(Yobserved \sim dbern(P_i)\)
\(logit(P_i) = β_0 + β_1*Expo(true)_i\)
Then, he proposed to link the unknown true exposure status of individual i, \(Expo(true)_i\), and, therefore, the unknown true exposure prevalence \(P(expo)\), to the probability of observing the exposure in individual i (\(P(observed.expo)_i\)), using the sensitivity (\(Se\)) and specificity (\(Sp\)) of the test used to measure the exposure and the observed exposure for individual i \(Expo(observed)_i\). This is the “measurement error” part of his model.
Measurement error part:
\(Expo(true)_i \sim dbern(P(expo))\)
\(Expo(observed)_i \sim dbern(P(observed.expo)_i)\) \(P(observed.expo)_i = Se*Expo(true)_i + (1-Sp)*(1-Expo(true)_i)\)
So, with this model, if the true exposure status is exposed (or exposure=1), then the probability of being observed as exposed is simply the \(Se\) of the test used to measure the exposure (since on the right part of the equation \((1-1)\) will yield a multiplication by zero). On the other hand, if the true exposure status was unexposed (or exposure=0), then the probability of being observed as exposed is \(1-Sp\) (since the left part of the equation is now multiplied by zero).
To run Gustafson’s model for non-differential-misclassification in R, you will have to:
1. Provide prior distributions (or exact values) for the test’s Se and Sp (Beta distributions could be used, since these are probabilities) and a prior distribution for the unknown true exposure prevalence \(P(expo)\), although this latter can be described using a vague distribution;
2. Modify the R script used to generate the model’s .txt file. For instance, the following script would generate such a .txt file:
# Specify the model:
model_exp_non_dif <- paste0("model{
#=== LIKELIHOOD ===#
for( i in 1 : ",num_obs," ) {
#RESPONSE PART
test[i] ~ dbern(P[i])
logit(P[i]) <- int + betaquest*expo_true[i]
#MEASUREMENT ERROR PART
expo_true[i] ~ dbern(Pexpo)
quest[i] ~ dbern(P_obs_expo[i])
P_obs_expo[i] <- Se*expo_true[i] + (1-Sp)*(1-expo_true[i])
}
#=== EXTRA CALCULATION TO GET OR ===#
ORquest <- exp(betaquest)
#=== PRIOR ===#
int ~ dnorm(",mu_int,", ",inv_var_int,") #PRIOR INTERCEPT
betaquest ~ dnorm(",mu_betaquest,", ",inv_var_betaquest,") #PRIOR COEFFICIENT
Pexpo ~ dbeta(1.0, 1.0) #FLAT PRIOR ON PREVALENCE OF TRUE EXPOSURE
Se ~ dbeta(", rval.se$shape1,",",rval.se$shape2, ") #PRIOR FOR SENSITIVITY
Sp ~ dbeta(", rval.sp$shape1,",",rval.sp$shape2, ") #PRIOR FOR SPECIFICITY
}")
#write to temporary text file
write.table(model_exp_non_dif, file="model_exp_non_dif.txt", quote=FALSE, sep="", row.names=FALSE,
col.names=FALSE)
The generated .txt file is illustrated below:
If you prefer to assign fixed values (rather than distributions) for the test’s \(Se\) and \(Sp\), you could modify the last two lines of the R script.
Gustafson’s model could be extended to handle differential misclassification of a binary exposure. Similarly to the model for differential outcome misclassification, you will now have to allow the test’s \(Se\) and \(Sp\) to vary according to the value of another variable.
For instance, if we allow the test’s \(Se\) and \(Sp\) to vary according to the value of the outcome, we could rewrite the measurement error part of the model using two equations:
In healthy: \(P(observed.expo)_i = Se_1*Expo(true)_i + (1-Sp_1)*(1-Expo(true)_i)\)
In diseased: \(P(observed.expo)_i = Se_2*Expo(true)_i + (1-Sp_2)*(1-Expo(true)_i)\)
Thus, if an individual is healthy, we will use \(Se_1\) and \(Sp_1\) to link \(Expo(observed)_i\) to \(Expo(true)_i\). In diseased individuals, we will use \(Se_2\) and \(Sp_2\). We will, therefore, have to:
1. Specify exact values or prior distributions for \(Se_1\), \(Sp_1\), \(Se_2\), and \(Sp_2\);
2. Indicate in the model’s .txt file whether, for a given individual, we need to use \(Se_1\) and \(Sp_1\) vs. \(Se_2\) and \(Sp_2\).
So the complete text file could be generated as follow (after having provided prior distributions or fixed values for \(Se_1\), \(Sp_1\), \(Se_2\), and \(Sp_2\), of course).
# Specify the model:
model_exp_dif <- paste0("model{
#=== LIKELIHOOD ===#
for( i in 1 : ",num_obs," ) {
#RESPONSE PART
test[i] ~ dbern(P[i])
logit(P[i]) <- int + betaquest*expo_true[i]
#CREATE VARIABLE INDICATING ACCURACY PARAMETERS TO PICK AS FUNCTION OF OUTCOME TEST
diff[i] <- test[i]+1
#MEASUREMENT ERROR PART
expo_true[i] ~ dbern(Pexpo)
quest[i] ~ dbern(P_obs_expo[i])
P_obs_expo[i] <- Se[diff[i]]*expo_true[i] + (1-Sp[diff[i]])*(1-expo_true[i])
}
#=== EXTRA CALCULATION TO GET OR ===#
ORquest <- exp(betaquest)
#=== PRIOR ===#
int ~ dnorm(",mu_int,", ",inv_var_int,") #PRIOR INTERCEPT
betaquest ~ dnorm(",mu_betaquest,", ",inv_var_betaquest,") #PRIOR COEFFICIENT
Pexpo ~ dbeta(1.0, 1.0) #FLAT PRIOR ON PREVALENCE OF TRUE EXPOSURE
Se[1] ~ dbeta(", rval.se1$shape1,",",rval.se1$shape2, ") #SENSITIVITY HEALTHY
Sp[1] ~ dbeta(", rval.sp1$shape1,",",rval.sp1$shape2, ") #SPECIFICITY HEALTHY
Se[2] ~ dbeta(", rval.se2$shape1,",",rval.se2$shape2, ") #SENSITIVITY DISEASED
Sp[2] ~ dbeta(", rval.sp2$shape1,",",rval.sp2$shape2, ") #SPECIFICITY DISEASED }")
#write to temporary text file
write.table(model_exp_dif, file="model_exp_dif.txt", quote=FALSE, sep="", row.names=FALSE,
col.names=FALSE)
The text file would look as follow:
Folder: Exercise 5 - Exposure misclassification
Open up the R project for Exercise 5 (Exercise 5 - Exposure misclassification.Rproj).
In the Question 1 folder, you will find a partially completed R script named Question 1.R. To answer the following questions, try to complete the missing parts (they are highlighted by a #TO COMPLETE# comment). Again, we also provided complete R scripts, but try to work it out on your own first!
We will re-investigate exercise #5 from the QBA part of the course on adjusting for misclassification of exposure, but using a Bayesian analysis. That part was described as follows:
It is probably very difficult to know what the Se and Sp are for the questionnaire (quest) in terms of predicting the true knowledge status of the individual. Your best estimates based on focus groups held during the development of the questionnaire are Se=0.8 and Sp=0.9.
The bias parameters are:
- Sensitivity of exposure classification: 0.80
- Specificity of exposure classification: 0.90
Start from the partially completed Question 1.R R script located in Question 1 folder. Using a Bayesian model compute a OR adjusted for exposure misclassification. First use a deterministic approach (using the bias parameters values proposed).
Modify the R script that you have completed for Question 1. Now use distributions for the bias parameters using values proposed as modes, and mode plus or minus 5 percentage-points for the 95th or 5th percentile.
Modify the R script that you have completed for Question 1. We will now explore the potential impact of non-differential misclassification of quest. Suppose that the Se of the test to measure knowledge about cysticercosis is exactly (deterministic approach) 0.90 among test+ individuals and 0.70 among test- and that the Sp is 0.95 among test+ and 0.85 among test-. Is the misclassification bias larger or smaller than in #1?
If you have the time, modify the R script that you have completed for Question 3. Now, include some uncertainty around the estimates provided in Question 3 by adding 5th percentiles located 10 percentage-points from the proposed value (e.g., mode at 0.90, 5th percentile at 0.80). What has changed?
Conducting independent adjustment for each bias provides information on direction and magnitude of bias, but it does not provide a final bias-adjusted point estimate with 95CI. Some biases may cancel each other out, so presenting the selection bias adjusted estimate, and then the outcome misclassification adjusted estimate, etc, will not provide a complete picture.
We can combine the concepts discussed in the previous sections to conduct a multiple biases analysis. Adjustements will be made sequentially. For instance, you could adjust your estimate for an outcome misclassification, and then apply a subsequent adjustment for selection bias to the misclassification bias-adjusted estimate. You will then have a misclassification bias-adjusted, selection bias-adjusted estimate and its 95CI.
When conducting a multiple biases analysis, we need to control biases sequentially in reverse order of occurrence. The figure below illustrates the introduction of systematic and random errors in observational studies. When controlling multiple biases, we should apply control sequentially from right to left (i.e., misclassification, then selection, then confounding bias).
Figure. Generation of observed data from a study showing introduction of errors from values we want to know, to values observed.
Folder: Exercise 6 - Multiple biases
Open up the R project for Exercise 6 (Exercise 6 - Multiple biases.Rproj).
In the Question 1 folder, you will find a partially completed R script named Question 1.R. To answer the following questions, try to complete the missing parts (they are highlighted by a #TO COMPLETE# comment). Again, we also provided complete R scripts, but try to work it out on your own first!
We will re-investigate exercise #7 from the QBA part of the course on adjusting for multiple biases, but using a Bayesian analysis. This part was described as follows:
Using the observed study data we will adjust for unmeasured confounding, misclassification of exposure, and selection bias. Use the following information to set up your bias parameters.
Start from the partially completed Question 1.R R script located in Question 1 folder. Using a Bayesian model compute an OR adjusted for unmeasured confounding, misclassification of exposure, and selection bias. In this case, in which order will you have to organize your biases adjustments? First use a deterministic approach (using the bias parameters values proposed, but ignoring the provided 5th and 95th percentiles or any other measures of spread of the distributions).
Modify the R script that you have completed for Question 1. Now use a stochastic approach using distributions for all bias parameters.
R script located in Question 1 folder. Use this script as a starting point to compute the crude association between quest and test using a logistic regression model estimated with a Bayesian analysis. Initially, use flat priors for your parameters. Make sure Markov chains’ behaviours are acceptable (i.e., ESS, autocorrelation plots) and that convergence is obtained (i.e., trace plot). How these results compare to those of the unadjusted \(OR\) computed using a frequentist approach (i.e., OR=0.47; 95%CI=0.37, 0.60)?When we do a quick exam of the associations between quest and test we see that the relationship is a OR of 0.47 (95CI: 0.37, 0.60).
## Outcome + Outcome - Total Inc risk * Odds
## Exposed + 137 1142 1279 10.7 0.120
## Exposed - 180 702 882 20.4 0.256
## Total 317 1844 2161 14.7 0.172
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio 0.52 (0.43, 0.64)
## Odds ratio 0.47 (0.37, 0.60)
## Attrib risk * -9.70 (-12.85, -6.54)
## Attrib risk in population * -5.74 (-8.79, -2.69)
## Attrib fraction in exposed (%) -90.53 (-133.87, -55.21)
## Attrib fraction in population (%) -39.12 (-52.17, -27.19)
## -------------------------------------------------------------------
## Test that OR = 1: chi2(1) = 39.212 Pr>chi2 = <0.001
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 population units
This OR and 95CI were estimated using a frequentist approach, though. Now, with the Bayesian model…
First, regarding MCMC diagnostics:
Now regarding the results:
## 50% 2.5% 97.5%
## int -1.362 -1.527 -1.201
## betaquest -0.761 -1.005 -0.519
## ORquest 0.467 0.366 0.595
We see that, when we used a Bayesian estimation with vague prior distributions, the estimated OR (95CI) is 0.47 (0.37, 0.60). Exactly that of the frequentist approach! This was expected, since vague priors were used for all parameters.
We could present the prior and posterior distributions for the coefficient of quest (i.e., the log OR) as shown below.Figure. Prior (dashed red) and posterior (full black) distribution for quest coefficient.
R script that you completed in Question 1. Run the same Bayesian logistic model, but this time use a prior distribution for quest coefficient that would give equal probability on the log odds scale to values between odds ratio ranging from 0.05 to 20. With such a distribution, you are explicitly saying that, a priori, you do not think the odds ratio could be very extreme, but that all intermediate values are possible. Hint: it is just the prior distributions that need to be modified in two places; 1) when creating your R objects and 2) in the model’s text file. How are your results modified?Figure. Prior distribution for betaquest.
We can now run the model…
## 50% 2.5% 97.5%
## int -1.361 -1.527 -1.198
## betaquest -0.761 -1.004 -0.526
## ORquest 0.467 0.366 0.591
Using that prior, after rounding off, the estimated odds ratio is still more or less the same. The OR (95CI) is 0.47 (0.37, 0.59). Actually, there is still very little information in that prior (i.e., we are just excluding OR below 0.05 and above 20 from our prior beliefs). On top of that, we have a relatively large dataset (n=2,161), so the data have a lot of weight compare to the prior distribution.
If we plot the prior and posterior distributions for betaquest we get:Figure. Prior (dashed red) and posterior (full black) distribution for quest coefficient.
R script that you completed for Question 1 or Question 2. Let go crazy a bit. For instance, you could believe that knowledge does not prevent the disease, but rather will increase odds of cysticercosis. Let say you are very convinced, a priori, that the most probable value for the \(OR\) is 2.0 with relatively little variance in that distribution. Pick a distribution that would represent that prior knowledge. How are your results modified?I chose to use a dnorm(0.69, 5) as prior for the coefficient for quest. So a normal distribution with mean OR of 2.0 and variance of the log odds would be 0.2 (or SD of 0.45).
This distribution looks like this:Figure. Prior distribution with mean OR of 2.0 and a variance of 0.2.
Figure. Normal distribution with mean of 0.69 and a variance of 0.20 (full line) vs. unif(-2.996, 2.996) distribution.
When we run this model, we get the following results. (the Markov chains looked appropriate):
## 50% 2.5% 97.5%
## int -1.410 -1.575 -1.249
## betaquest -0.659 -0.893 -0.426
## ORquest 0.517 0.409 0.653
Using that prior, I get a OR (95CI) of 0.52 (0.41, 0.65). So the estimated odds ratio is now somewhere between our prior knowledge and the frequentist estimate, although still quite close to the frequentist estimate (see next figure). In this case, even with the substantial information contained in our prior distribution, the estimation process is only slightly affected by our prior knowledge. So the dataset size (n=2,161) is still driving most of the estimation process.
We see this quite well, when we plot this very informative prior distribution along with the posterior distribution obtained vs. the posterior distribution of the model where we used a vague prior distribution:Figure. Prior informative distribution with mean of 0.69 and a variance of 0.20 (red dashed) vs. the obtained posterior distribution (full black) and the posterior distribution obtained using vague priors (blue dashed).
R scripts. To appraise the effect of priors vs. sample size, work on similar analyses, but this time use the cyst_bias_study_reduced.csv data set, which only have 60 observations (compared to 2,161 observations for the complete data set). Run a logistic regression model of the effect of quest on test, using vague prior distributions. Then run the same model, but with the prior distributions that you used in Question 3. Is the effect of the informative priors more important now?If we explore the crude association in this smaller (n=60) data set using a frequentist approach:
## Outcome + Outcome - Total Inc risk * Odds
## Exposed + 5 33 38 13.2 0.152
## Exposed - 4 18 22 18.2 0.222
## Total 9 51 60 15.0 0.176
##
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio 0.72 (0.22, 2.42)
## Odds ratio 0.68 (0.16, 2.86)
## Attrib risk * -5.02 (-24.40, 14.35)
## Attrib risk in population * -3.18 (-21.66, 15.29)
## Attrib fraction in exposed (%) -38.18 (-361.26, 58.60)
## Attrib fraction in population (%) -21.21 (-137.09, 38.03)
## -------------------------------------------------------------------
## Test that OR = 1: chi2(1) = 0.276 Pr>chi2 = 0.60
## Wald confidence limits
## CI: confidence interval
## * Outcomes per 100 population units
We observed a frequentist OR (95CI) of 0.68 (0.16, 2.9).
If we ran a Bayesian model with vague priors (the Markov chains looked OK), we get:
## 50% 2.5% 97.5%
## int -1.583 -2.839 -0.554
## betaquest -0.349 -1.882 1.120
## ORquest 0.705 0.152 3.065
An OR (95CI) of 0.71 (0.15, 3.1), quite close to the frequentist estimates.
Now, with a Bayesian model with informative priors in the form of a dnorm(0.69, 5), (again the Markov chains looked OK), we get:
## 50% 2.5% 97.5%
## int -2.056 -3.041 -1.220
## betaquest 0.412 -0.339 1.203
## ORquest 1.510 0.713 3.332
An OR (95CI) of 1.5 (0.71, 3.3)… We can see that the results are now affected a lot by the choice of prior. We now have a lot of information in our prior beliefs, compared to the information in the data set. Below I plotted the informative prior distribution and the resulting posterior distribution, as well as the posterior distribution that we have obtained using the vague priors. It illustrate quite well the impact of using informative priors in a small data set.
Figure. Prior informative distribution with mean of 0.69 and a variance of 0.20 (red dashed) vs. the obtained posterior distribution (full black) and the posterior distribution obtained using vague priors (full blue).
R script located in Question 1 folder. Try to adjust the OR obtained from a conventional logistic model using Lash’s method. Use vague priors for the intercept and the coefficient of your logistic model. assume that you are not 100% sure about the values that should be used for your bias parameters, so use distributions for these, rather than exact values. For instance we could assume that:How these results compare to the one obtained with the QBA analyses?
The Markov chains looked OK. Median adjusted OR is 0.26 (virtually unchanged as compared to the QBA analysis), but 95%CI are now 0.16 to 0.39, so slightly larger. This is expected because we are now not only considering uncertainty due to sample size, but also uncertainty in our bias parameters. Below, I plotted the posterior distributions for the adjusted (full black) and biased (red dashed) OR.
Figure. Posterior distributions for the adjusted (full black) and biased (red dashed) OR.
R script that you completed in Question 1. To investigate effect of the imprecision in your bias parameters, change the normal distribution for the log of the unmeasured confounder-disease OR. Keep a mean of 1.6, but use a larger variance, for instance 1.0 (i.e., an inverse variance of 1.0). How is your CI affected? Again, Markov chains are fine. The median adjusted OR (95CI) is now 0.27 (0.14, 0.55). So the median estimate has not changed much, but the 95CI is larger. This is well visible on the following figure where I have plotted the posterior distributions of the biased OR (red dashed), of the adjusted OR assuming a variance of 0.2 of the unmeasured confounder-disease OR (blue) as well as the one assuming a variance of 1.0 (black) of this latter parameter.
Figure. Posterior distributions for the adjusted (black and blue) and biased (red dashed) OR.
R script located in Question 3 folder. Re-investigate this bias analysis using the alternative model proposed by McCandless. Use the same bias parameters distributions proposed in question 1. Do you get similar results? Note that this later model could easily be expanded to add effect measure modification (i.e., an interaction term between exposure and confounder). Using that expanded model, you could then report effect of exposure by strata of confounder.When running this model, there was some issues with the Markov chains. It took almost 800 iterations for convergence, and the autocorrelation of the Markov chains is maintained for iterations that are many steps apart. This is translated in the trace plots, where we can still see the chains cycling up and down slightly (though they are still moving in a relatively very narrow space after 800 iterations). This is also translated in the ESS that is relatively small. This unwanted feature could be overcome by running the chains for a longer number of iterations.
Figure. Diagnostic plots.
If it was for a publication, (vs. our exercise!), we could also investigate whether using informative priors for all unknown parameters would help. By experience, latent class models are more prone to these kind of issues. For now, I ran the chains for 15,000 iterations with a burn in of 1,000. With that the EES was > 1,100 for all parameters.
Using McCandless’s method, we obtained a median adjusted OR (95CI) of 0.23 (0.12, 0.38) which is very close to the one obtained with Lash’s adjustment using the same bias parameters (in question 1). To illustrate this, I plotted the posterior distributions for the adjusted OR using McCandless (black) vs. Lash (blue) vs. the biased OR (red dashed)
Figure. Posterior distributions for the adjusted (black and blue) and biased (red dashed) OR.
R script located in Question 1 folder. Try to adjust the OR obtained from a conventional logistic model using the equation provided during the lecture on selection bias. Use vague priors for the intercept and the coefficient of your logistic model. As a first step, for the bias parameters use a deterministic approach (i.e., exact values) rather than prior distributions. How these results compare to that of exercise #5? What is going on? The adjusted and biased OR are exactly the same: 0.47 (95CI: 0.37, 0.60). There is no bias because \(OR(sel)\) is:
\(OR(sel) = \frac{0.25*0.30}{0.75*0.10} = \frac{0.075}{0.075} = 1\)
Thus, the biased OR has to be multiplied by 1 to obtain the adjusted OR (i.e., with these selection probabilities, there is no selection bias).
R script that you have completed for Question 1. Now assume that selection probability among diseased unexposed is also 0.75. For now stick with a deterministic approach (i.e., exact values) rather than prior distributions for bias parameters. Is there a selection bias in that case?There will very likely be a selection bias in that case, since:
\(OR(sel) = \frac{0.75*0.30}{0.75*0.10} = \frac{0.25}{0.075} = 3.33\)
The adjusted OR was 1.4 (95CI: 1.1, 1.8) while the biased OR was 0.47 (95CI: 0.37, 0.60). Below is a plot of the biased vs. adjusted OR.
When looking at the figure, it gives the impression that the posterior distribution for the adjusted OR is wider. This should not be the case, since we used a deterministic approach. It is simply an artifact of the scale on which it was measured. If we compare the biased and adjusted association on a log(odds) scale (see below), there is no differences anymore.
Figure. Posterior distributions for the biased (red dashed) vs. adjusted (black) OR.
R script that you have completed for Question 2. Use these same bias parameters, but now propose distributions for these. For instance, you could use the proposed selection probabilities as modes of the distributions and have the 5th or 95th percentiles fixed at -10 or +10 percentage-points from that mode. How do your results differ from that of Question 2?The adjusted OR is now 1.3 (95CI: 0.55, 3.6). So the median adjusted OR is almost unchanged, but the 95CI is larger, as expected. It is because we are now considering uncertainty around the bias parameters in addition to the uncertainty due to the sample size. Below, I have plotted this adjusted OR along with those from Question 2.
The adjusted OR is now 1.1 (95CI: 0.29, 6.4). So the median adjusted OR is slightly changed and the 95CI is even larger. Below, I have plotted this adjusted OR along with those from Question 2 and 3 for for biased (red dashed), adjusted stochastic with wide (black) vs. narrower (blue) prior distributions for bias parameters, or adjusted deterministic OR (green).
1. Start from the partially completed Question 1.R R script located in Question 1 folder. Develop a latent class model that would provide an OR adjusted for outcome misclassification. As a first analysis, use point estimates for the Se and the Sp (i.e., use a deterministic approach). Try the value 1.0 for Se and Sp. Thus, you would now have a model that explicitly admit that Se and Sp of the diagnostic test is perfect. How do these results differ from the conventional logistic model?
As expected, we get an OR (95% CI) of 0.47 (0.37, 0.59) which is the same as the conventional logistic model (which implicitly state that Se and Sp are perfect). However, for the first time we can observe Markov chains that are slightly less than perfect. For instance, we now see a bit of autocorrelation (see Figure below). Nothing dramatic though, these are still very acceptable. The EES ranged from 981 to 1316, depending on the parameter.
Figure. Diagnostic plots.
Why is it happenning? When using the value 1.0 for the test Se and Sp, the model should be mathematically equivalent to a conventional logistic model (i.e, a model with just a response part). The model is indeed the same, but the Markov chains are not. This is because of the way OpenBUGS interprets the code to automatically construct the chains. OpenBUGS does not realize the redundancies in the code when Se=1 and Sp=1, and therefore comes up with unnecessarily complex chains.
2. Modify the R script that you have completed for Question 1. Stick with the deterministic approach, but now use Se=0.90 and Sp=0.97. Are the results similar to those of exercise #5 from the QBA part of the course?
We get an OR (95CI) of 0.39 (0.28, 0.53). Very similar to what we had using EpisensR. We still have the autocorrelation problem with the Markov chains. Actually, the MC also took more time (approx. 300 iterations) to reach convergence.
3. Modify the R script that you have completed for Question 2. Now use a probabilistic approach where distributions are used to describe your knowledge about Se and Sp of the diagnostic test. For instance, in the literature you may have found that, for the Se, a distribution with mode=0.90 and 5th percentile=0.80 would be realistic. Similarly, the Sp could be described with a distribution with mode=0.97 and 5th percentile=0.90. What is you prediction (a priori) regarding your results? You may have to use informative priors for your unknown parameters. You will possibly have to have a long burn in period and run for many iterations, etc…
If you used vague prior for your unkown parameters, the first thing you will notice is that the Markov chains are awful! I kept the Monte Carlo simulation running for 20,000 iterations and convergence was still not obtained. Actually, when the latent class model is run with distributions for the Se and Sp estimates and vague priors for our intercept and predictor, a single solution cannot be identified anymore. Hence the Markov chains that could not move toward a definite space, representative of the posterior distribution.
For that model to work, you will need to commit yourself and add some prior knowledge regarding your unknown parameters. For this model, I have tried different approaches. In the end, I chose to use a dUnif(-2.0, 2.0) for the intercept, as well as for the coefficient of quest. Thus, my prior beliefs are that:
- the OR would not be smaller than 0.14 or greater than 7.4.
- the prevalence of disease in unexposed is > 0.12 and < 0.88.
These are moderately informative priors, but remember that they will be overridden by our large data set. These priors seemed sufficient for improving our Markov chains. Actually, dUnif(-3.0, 3.0) were also improving the Markov chains a lot and could, perhaps, be more acceptable? They would correspond to OR between 0.05 and 20.0 and to prevalence of disease in unexposed between 0.05 and 0.95.
Moreover, I also ran the Markov chains for 15,000 iterations and applied a burn in period of 5,000. Below are the diagnostic plots with these parameters. The ESS with these parameters ranged from 613 to 696, depending on the parameters. We could run it to generate a larger ESS if it was for publication.
The OR is slightly modified at 0.32 (note that 0.32 and 0.39 is not a huge difference for an OR), but the 95CI is larger at 0.15 to 0.51. Below, I have plotted the unadjusted OR (red dashed), along with the OR adjusted using a deterministic (blue) vs. stochastic (black) approach.
Why was the median OR estimate modified in that case? In most other exercises, when we switched from a deterministic to a probabilistic approach, the 95CI was enlarged, but the OR point estimate was the same…
When a disease is relatively uncommon, a loss of specificity of a few percentage-points may yield a large bias because a large number of healthy individuals will be misclassified as sick. For instance, with 10 truly sick and 100 truly healthy, a 5 percentage-points drop in Se will lead to misclassification of 0.5 individual (from truly sick to diagnosed as healthy). In comparison, with the same drop in Sp we would have 5 truly healthy individuals diagnosed as sick. By specifying a Sp distribution with mode of 0.97, but especially with 5th percentile of 0.90, which allows Sp to be lower than 0.97, we increase the possibility for a larger bias.
4. Start from the partially completed Question 4.R R script located in Question 4 folder. Using the same data, develop a model that would allow for differential outcome misclassification (i.e., different Se and Sp values in exposed and unexposed). First, start with a deterministic approach. For instance, use Se=0.85 and Sp=0.99 in unexposed and Se=0.95 and Sp=0.95 in exposed.
This time the Markov chains are appropriate. I got an OR (95CI) of 0.22 (0.15, 0.32).
5. Modify the R script that you have completed for Question 4. Now use distributions for your bias parameters. Use the previously proposed values as modes for the distributions, and mode minus 5 percentage-points as 5th percentiles. You will probably have to use informative priors for the quest coefficient. You will possibly have to have a long burn in period and run for many iterations, etc…
Again, here I used informative priors: dUnif(-2.0, 2.0) for the intercept, as well as for the coefficient of quest. And I had to “fine tune” the MCMC analysis: 15,000 iterations, and a burn in period of 5,000 to get decent ESS. With that, I had a median OR of 0.23 (95CI: 0.14, 0.44). Below I plotted the unadjusted OR (red dashed), along with the OR adjusted for differential misclassification using a deterministic (blue) vs. stochastic (black) approach.
R script located in Question 1 folder. Using a Bayesian model compute a OR adjusted for exposure misclassification. First use a deterministic approach (using the bias parameters values proposed).The Markov chains are nice. With this deterministic approach we get an OR (95CI) of 0.31 (0.22, 0.44).
R script that you have completed for Question 1. Now use distributions for the bias parameters using values proposed as modes, and mode minus 5 percentage-points for the 5th percentile.Autocorrelation was present, and, with 5,000 iterations, I had an ESS ranging from 217 (for the intercept) to 419 (for the OR). I increased the number of iterations and then had EES ranging from 547 (for the intercept) to 1002 (for the OR). If it was for publication, we could certainly let it run more to get larger EES.
With the probabilistic approach, I got an OR of 0.30 (95CI: 0.17, 0.45).
Below, I plotted the posterior distributions for the adjusted OR estimated using the deterministic (blue) and stochastic (black) approaches. Note that they are not very different, since we used very narrow distributions with the stochastic approach for the Se and Sp of the test used for measuring the exposure.
R script that you have completed for Question 1. We will now explore the potential impact of non-differential misclassification of quest. Suppose that the Se of the test to measure knowledge about cysticercosis is exactly (deterministic approach) 0.90 among test+ individuals and 0.70 among test- and that the Sp is 0.95 among test+ and 0.85 among test-. Is the misclassification bias larger or smaller than in #1?With this differential bias, we now have an adjusted OR (95CI) of 0.14 (0.09, 0.21). So the differential exposure misclassification bias was creating a larger bias (from an OR of 0.14 to 0.47) than the non-differential (from an OR of 0.14 to 0.47).
R script that you have completed for Question 3. Now, include some uncertainty around the estimates provided in Question 3 by adding 5th percentiles located 10 percentage-points from the proposed value (e.g., mode at 0.90, 5th percentile at 0.80). What has changed?With these parameters, the Markov chains were not converging. I tried with dunif(-2.0, 2.0) priors for the intercept and coefficient (we used them before for another LCM). I also specified a dunif(0.30, 0.70) for prevalence of truly exposed individuals, and ran it for 10,000 iterations. With these parameters, the Markov chains were improved, but the ESS were still small. I am still working on it!
R script located in Question 1 folder. Using a Bayesian model compute an OR adjusted for unmeasured confounding, misclassification of exposure, and selection bias. In this case, in which order will you have to organize your biases adjustments? First use a deterministic approach (using the bias parameters values proposed, but ignoring the provided 5th and 95th percentiles or any other measures of spread of the distributions).We should start by controlling exposure misclassification, than selection bias (I used the adjustment proposed by Lash to make sure it would be apply after adjustment for misclassification), and, finally, confounding.
After adjusting for exposure misclassification, I get an OR (95CI) of 0.31 (0.22, 0.44). Adding adjustment for selection bias, it is still 0.31 (0.22, 0.44). This is expected, since we already determined that, with the bias parameters used, there was no selection bias (\(OR(sel)=1.0\)). Finally, when adding adjustment for unmeasured confounder the OR was 0.17 (0.12, 0.23).
Below, I have plotted the posterior distribution for the unadjusted OR as a red dashed line. The exposure misclassification-adjusted OR is the purple line. So the misclassification bias was pulling the OR toward higher values than the true value. We cannot see it on the figure, but a blue line is sitting exactly underneath the purple line, this is the selection bias + exposure misclassification adjusted OR (no selection bias here). Finally, the black line represents the posterior distribution for the confounding + selection bias + exposure misclassification adjusted OR. It seems that the unmeasured confounder was also biasing the OR estimate toward higher values.
Figure. OR biased (red dashed) and adjusted posterior distributions estimated using a deterministic approach (purple = exposure misclassification-adjusted, black = confounding + selection bias + exposure misclassification-adjusted.
Given that there is no selection bias (i.e., the selection odds ratio=1.0), I would suggest, for simplicity, to remove adjustment for selection bias if this analysis was for publication.
R script that you have completed for Question 1. Now use a stochastic approach using distributions for all bias parameters.I had to fine tune the MCMC analysis. Autocorrelation was present and EES were relatively small for the intercept and the coefficient. I ran the model for 15,000 iterations and kept burn-in at 1,000. With these simulation parameters, the Markov chains looked better and I had EES ranging from 744 (for the intercept) to 1396 (for the coefficient). We could certainly run it for more if it was for publication.
After adjusting for exposure misclassification, I get an OR (95CI) of 0.30 (95CI: 0.18, 0.44). Adding adjustment for selection bias, the point estimate is still 0.29, but with a 95CI 0.14 to 0.60. The 95CI changed because we added imprecision around the bias parameters for selection bias (thus, selection bias OR could now be different than 1.0 on some iterations where the selection probabilities picked did not canceled out each other). Finally, when adding adjustment for unmeasured confounder the OR was 0.16 (95CI: 0.07, 0.36).
Below, I have plotted the posterior distribution for the unadjusted OR as a red dashed line. The exposure misclassification-adjusted OR is the purple line. So the misclassification bias was pulling the OR toward higher values than the true value. The blue line is the selection bias + exposure misclassification adjusted OR, this adjustment just enlarged the distribution. Finally, the black line represent the posterior distribution for the confounding + selection bias + exposure misclassification adjusted OR. It seems that the unmeasured confounder was also biasing the OR estimate toward higher values.
Figure. OR biased (red dashed) and adjusted posterior distributions estimated using a stochastic approach (purple = exposure misclassification-adjusted, blue = black = selection bias + exposure misclassification-adjusted, confounding + selection bias + exposure misclassification-adjusted.
Special thanks to my colleagues from University of Prince Edward Island, Henrik Stryhn and Ian Dohoo who reviewed and commented this document and the R scripts for the exercises.
T. L. Lash, M. P. Fox and A. K. Fink. 2009. Applying quantitative bias analysis to epidemiologic data.↩︎
L. C. McCandless, P. Gustafson, and A. Levy. 2007. Bayesian sensitivity analysis for unmeasured confounding in observational studies.↩︎
P. McInturff, W. O. Johnson, D. Cowling, and I. A. Gardner. 2004. Modelling risk when binary outcomes are subject to error.↩︎
P. Gustafson. 2003. Measurement Error and Misclassification in Statistics and Epidemiology - Impacts and Bayesian Adjustments.↩︎